Probability Theory

1. Probability
2. Probability Space
- 2.1. Definition
- 2.2. Probability Measure
  - 2.2.1. Definition
  - 2.2.2. Notations
3. Random Variable
4. Probability Distribution
5. Parametric Family
6. Stochastic Process
7. References

Probability Calculus

1. Probability

1.1. Marginal Probability

Probability distribution of a subset of a larger collection of random variables.

1.2. Conditional Probability

Probability contingent upon the values of the other variables.
\[ p_{Y|X}(y\mid x) := \mathrm{P}[Y=y\mid X=x] = \frac{\mathrm{P}(\{X=x\}\cap \{Y=y\})}{\mathrm{P}(\{X=x\}} \]
\[ f_{Y|X}(y\mid x) = \frac{f_{X,Y}(x,y)}{f_X(x)} \]

1.3. Law of Total Probability

Relation between marginal probability and conditional probability. \[ \mathrm{P}(A) = \sum_k \mathrm{P}(A\cap B_k) = \sum_k \mathrm{P}(A\mid B_k)\mathrm{P}(B_k) \] Genreally, \[ \mathrm{P}(A) = \int_\Omega\mathrm{P}(A\mid X)\,\mathrm{dP}. \]

Further, \[ \mathrm{P}(A\mid B) = \sum_n \mathrm{P}(A \mid C_n)\,\mathrm{P}(C_n\mid B). \]

1.4. Bayesian Probability

Interpretation of the probability as reasonable expectation, instead of frequency or propensity.
It represents a state of knowledge or as quantification of a personal belief.

2. Probability Space

2.1. Definition

A measure space such that the measure of the whole space is one.
It is a triple \((\Omega, \Sigma, \mathrm{P})\) consists of
- The sample space \(\Omega\): an arbitrary non-empty set,
- The event space \(\Sigma\): σ-algebra on \(\Omega\),
  - \(\Sigma\) for sigma algebra. \(\mathcal{F}\) is also often used instead, for filtration. or \(\mathcal{A}\) by conventions.
- The probability measure \(\mathrm{P}: \mathcal{F} \to [0, 1]\).

2.2. Probability Measure

2.2.1. Definition

Probability measure \(\mathrm{P}\) is a measure over \(\Omega\), with:
- \(\mathrm{P}: \sigma(\Omega) \to [0, 1]\), with \(\mathrm{P}(\varnothing) = 0\) and \(\mathrm{P}(\Omega) = 1\).
- Countable Additivity: \[ \mathrm{P}\left(\bigcup_{i\in \mathbb{N}}E_i\right) = \sum_{i\in \mathbb{N}}\mathrm{P}({E_i}) \] where \(\{E_i\}\) are pairwise disjoint sets.
The validity of this definition of a probability measure is precisely given by the Kolmogorov axioms.¹
1. Non-negativity
2. Unit measure
3. σ-additivity

2.2.2. Notations

Probability that a random variable \(X\) takes a value in a measurable set \(S\subseteq E\) is written as \[ \mathrm{P}[X\in S] := \mathrm{P}(\{\omega \in \Omega\mid X(\omega)\in S\}). \]

3. Random Variable

Kolmogorov Definition
Random variable is a measurable function from the sample space to a measure space \(E\), often taken to be \(\mathbb{R}\): \[ X: \Omega \to E. \]

4. Probability Distribution

Probability distribution forgets the probability space, and only remembers the output values of a random variable.

4.1. Properties

4.1.1. Mean

4.1.2. Variation

4.1.3. Skewness

왜도

4.1.4. Kurtosis

첨도

4.1.5. Absolutely Continuous

Probability function whose domain has infinite elements.
Random variable \(X\) is absolutely continuous, if there exists a function \(f_X\) such that for each interval \([a,b] \in \mathbb{R}\): \[ \mathrm{P}[a\le X \le b] = \int_a^b f_X(x)\,dx \]

4.2. Probability Mass Function

The probability mass function \(p_X(x)\) is defined as \[ p_X(x) := \mathrm{P}[X=x]. \]

4.3. Probability Density Function

4.3.1. Definition

The probability density function of a random variable \(X\) that takes values in a measure space \((E, \Sigma, \mu)\) is the Radon-Nikodym derivative of the pushforward measure \(X_*\mathrm{P}\) with respect to the reference measure \(\mu\): \[ f_X(x) := \frac{d X_*\mathrm{P}}{d\mu} \]
That is, \[ X_*\mathrm{P}(A) = \int_A f_X\,d\mu \] for any measurable set \(A\in \Sigma\).

4.3.2. Properties

For a real-valued random variable, it is absolutely continuous univariate distribution satisfying:
- \[ f_X(x) = \frac{dF_X}{dx}, \]

4.4. Cumulative Probability Function

The cumulative probability function of a real-valued random variable is \[ F_X(x) := \mathrm{P}[X\le x]. \]

4.5. Normalization and Denormalization

The random variable \(X\) is contravariant, and the underlying probability function is covariant:
- \[ Z = \frac{X - \mu}{\sigma} \]
- \[ f_Z(x) = \sigma f_X(\sigma x + \mu) \]
  - The \(\sigma\) factor is the normalization constant, to compensate for \(f_X\) being scaled down in the \(x\) direction by the factor of \(\sigma\).

It is the inverse of the normalization:
- \[ X = \sigma Z + \mu \]
- \[ f_X(x) = \frac{1}{\sigma}f_Z\left(\frac{x - \mu}{\sigma}\right) \]

The random variable and probability distribution transforms oppositely.

4.6. Combination

4.6.1. Distribution of Sum

\[ f_{X+Y}(x) = (f_X * f_Y)(x) \]
where \(*\) is the convolution.
List of convolutions of probability distributions - Wikipedia

4.6.2. Product Distribution

\[ f_{XY} = \int_{-\infty}^\infty f_{X,Y}(x, z/x)\frac{1}{|x|}\,dx \]
Distribution of the product of two random variables - Wikipedia

4.6.3. Ratio Distribution

\[ f_{X/Y} = \int_{-\infty}^\infty f_{X,Y}(zy, y)|y|\,dy \]
Ratio distribution - Wikipedia

4.7. Instances

The unexpected probability result confusing everyone - YouTube
- For independent uniformly distributed random variables \(X\) and \(Y\), the probability distribution of the \(\max(X,Y)\) is equal to the probability distribution of \(\sqrt{X}\).
- Similarly, the probability distribution of the \(\max(X_1, X_2,\dots, X_n)\) is equal to the probability distribution of \(\sqrt[n]{X_1}\)
- Shockingly, the probability distribution of the \(XY^Z\) is uniform, for the independent uniformly distributed random variables \(X,Y,Z\).

How to Learn Probability Distributions - YouTube

4.7.1. Discrete Distributions

4.7.1.1. Bernoulli Distribution

4.7.1.2. Binomial Distribution

4.7.1.3. Multinomial Distribution

Higher dimensional binomial distribution.
\[ f(x_1,\dots,x_k;n,p_1,\dots,p_k) = \binom{n}{x_1,\dots,x_k}\prod_{i=1}^k p_i^{x_i} \]
- using the multinomial coefficient.
Multinomial distribution - Wikipedia

4.7.1.4. Poisson Distribution

\[ f(k;\lambda) = \frac{\lambda^ke^{-\lambda}}{k!} \]
Poisson distribution - Wikipedia

4.7.1.5. Geometric Distribution

4.7.2. Continuous Distributions

4.7.2.1. Normal Distribution

Gaussian Distribution
\(\mathcal{N}(\mu, \sigma^2)\)

4.7.2.1.1. Probability Density Function

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

4.7.2.1.1.1. Derivation

Normalizing the area of the Gaussian function: \[ e^{-x^2} \rightsquigarrow \frac{1}{\sqrt{\pi}}e^{-x^2} \]
- This was the definition of the standard normal by Carl Friedrich Gauss
- It has standard deviation of \(1/\sqrt{2}\).
Denormalizing the probability distribution to mean \(\mu\) and standard deviation \(\sigma\): \[ \frac{1}{\sqrt{\pi}}e^{-x^2} \rightsquigarrow \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}. \]

4.7.2.2. Chi-Squared Distribution

4.7.2.2.1. Definition

For independent, standard normal random variables \(Z_1, \dots, Z_k\), \[ Q = \sum_{i=1}^k Z_i^2 \] is distributed according to the chi-squared distribution with \(k\) degrees of freedom: \[ Q \sim \chi^2(k). \]
This distribution arises in least square method.
Chi-squared distribution - Wikipedia

4.7.2.3. F-Distribution

((66cc6b5d-2778-4bfb-9840-7a4d91f147df))

4.7.2.4. Student's T-Distribution

T-Distribution
The name is from William Sealy Gossett.
It has fat tails. The shape of the t-distribution approaches the standard normal distribution, as the sample size increases.
It is a parametric family \(t_{\rm DF}\) with respect to the degrees of freedom which is directly related to the sample size.

4.7.2.5. Cauchy Distribution

Lorentz Distribution, Cauchy-Lorentz Distribution, Lorentzian Function, Breit-Wigner Distribution
\[ f(x; x_0, \gamma) = \frac{1}{\pi\gamma\left[1+\left(\frac{x-x_0}{\gamma}\right)^2\right]}. \]
Its mean is undefined.

4.7.2.6. Exponential Distribution

Negative Exponential Distribution
In terms of rate \(\lambda\):
- \[ f(x;\lambda) = \lambda e^{-\lambda x} \]
- with \(f(x;\lambda) = 0\) if \(x<0\).
In terms of scale parameter \(\beta = 1/\lambda\):
- \[ f(x;\beta) = \frac1\beta e^{-x/\beta} \]
The distance between consecutive events in a Poisson point process.
Continuous Geometric Distribution

4.7.2.7. Beta Distribution

4.7.2.7.1. Defintion

\[ f(x; \alpha, \beta) := \frac{x^{\alpha-1}(1-x)^{\beta-1}}{\mathrm{B}(\alpha, \beta)} \]
- where \(\mathrm{B}(\alpha,\beta)\) is the beta function.

4.7.2.7.2. Properties

It is the probability distribution of the estimator \(\hat{p}\) of the probability of observing a positive event after observing \(\alpha-1\) positive events and \(\beta-1\) negative events.
- \[ \hat{p} = \frac{\alpha-1}{\alpha + \beta-2} \sim \mathcal{Be}(\alpha, \beta) \]
- \[ \mathcal{Be}(\alpha, \beta) = \mathrm{P}[X_{\alpha+\beta-1} = \omega_+ \mid X_{\sigma(1)} = \cdots = X_{\sigma(\alpha-1)} = \omega_+, X_{\sigma(\alpha)} =\cdots = X_{\sigma(\alpha+\beta - 2)} = \omega_-] \]
- where \(X_i\)s are independent and identically distributed (iid) random variables, with unknown probability distribution.
The harmonic mean is symmetric: \[ H_X = H_{1-X} \]
- where \(H_X\) is defined to be \[ H_X := \frac{1}{\mathrm{E}\left[\frac{1}{X}\right]} \]
Concentration \(\kappa := \alpha+\beta\)
mode \[ \omega = \frac{\alpha-1}{\alpha+\beta -2} \]
Variance \[ \sigma^2 = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha+\beta +1)} \]
- It is asymptotically equal to the variance of the sample mean \(\bar{x}\) of random variables distributed as Bernoulli distribution. \[ \frac{\hat{p}\hat{q}}{n} = \widehat{\sigma^2[\hat{p}]}, \quad\sigma^2[\bar{x}]= \sigma^2[\hat{p}] \]

4.7.2.7.3. Beta Prime Distribution

The probability distribution of the estimator of odds.

5. Parametric Family

Explaining Parametric Families - YouTube
A parametric family is a statistical model of an unknown probability distribution, which one can do analysis on.

6. Stochastic Process

A sequnece of random variables \( X = (X_i)_i \).

6.1. Quadratic Variation

One kind of variation of a stochastic process.

6.1.1. Definition

\[ [X]_t = \lim_{\Vert P\Vert \to 0} \sum_{k=1}^n(X_{t_k} - X_{t_{k-1}})^2 \] where \(P\) is the partition over the interval \([0, t]\).

6.1.2. Covariation

Cross-Variance

6.1.2.1. Definition

\[ [X, Y]_t =: \lim_{\Vert P\Vert \to 0}\sum_{k=1}^{n}(X_{t_k} - X_{t_{k-1}})(Y_{t_k} - Y_{t_{k-1}}) \]

6.1.2.2. Properties

By the polarization identity: \[ [X, Y]_t = \frac{1}{2}([X + Y]_t - [X]_t - [Y]_t). \]
For the semimartingales: \[ d(X_tY_t) = X_{t-}dY_t + Y_{t-}dX_t + dX_tdY_t. \] where \(dX_tdY_t := d[X, Y]_t\).

6.2. Martingale

A stochastic process is called martingale, if the expected value of the immediate future is equal to the value of the current variable.

6.2.1. Definition

A discrete-time stochastic process \( X \) is martingale if, for any time \(n\):

\[ \operatorname{E}[|X_n|] < \infty, \]
\[ \operatorname{E}[X_{n+1}\mid X_1,\dots,X_n ] = X_n. \]

Generally, a stochastic process \(Y: T\times \Omega \to S\), where \(S\) is a Banach space with norm, is a martingale with respect to filtration \(\Sigma_{*}\), and probability measure \(\mathrm{P}\), if:

\( \Sigma_{*} \) is a filtration of the event space \(\Sigma\) of the underlying probability space \((\Omega,\Sigma, \mathrm{P})\).
\(Y\) is adapted to the filtration \(\mathcal{F}\), that is, for each \(t\) in the index set \(T\), the random variable \(Y_t\) is a \( \Sigma_t \)-measurable function.
For each \(t\), \(Y_t\) lies in the \(L^p\) space \(L^1(\Omega, \mathcal{F}_t, \mathrm{P}; S)\): \[ \operatorname{E}[\Vert Y_t \Vert_S] < \infty. \]
For all \(s\) and \(t\) with \(s

6.2.2. Local Martingale

A \( \Sigma_{*} \)-adapted stochastic process \( X \) is called an \( \Sigma_* \)-local martingale, if there exists a sequence of \( \Sigma_* \)-stopping times \( \tau_k\colon \Omega \to [0, \infty) \) such that

the \( \tau_k \) are almost surely increasing: \( \mathrm{P}[\tau_k < \tau_{k+1}] = 1 \),
the \( \tau_k \) diverge almost surely: \( \mathrm{P}[ \lim_{k\to \infty} \tau_k = \infty ] = 1 \),
the stopped process \( X_t^{\tau_k} := X_{\min\{t, \tau_k\}} \) is an \( \Sigma_{*} \)-martingale for every \( k \).

Each \( \tau_k \) indicates a single stopping criterion, for example, "if the value of the stochastic process drops below zero".

6.2.3. Semimartingale

A real-valued stochastic process \( X \) is called a semimartingale, if it can be decomposed into a local martingale \( M \) and a cádlág adapted process \( A \) of locally bounded variation: \[ X_t = M_t + A_t. \]

6.3. Markov Chain

Markov Process

A discrete-time Markov chain is a sequence of random variables \( (X_1, X_2, X_3, \dots ) \) where each \( X_{i}\colon \Omega \to S \) takes a value called state in the state space \( S \), with \( X_{i+1} \) only dependent to the previous variable \( X_i \).

In the continuous-time Markov chain, the index can also take a continuous value \( t \ge 0 \). In this case, the transition-rate matrix \( \mathbf{Q} \) is defined: \[ \mathbf{Q} := \lim_{h\to 0} \frac{\mathrm{P}[X_{t+h} = j | X_t = i] - \delta_{ij}}{h}. \] The matrix is also known as a Q-matrix, intensity matrix, or infinitesimal generator matrix.

6.4. Wiener Process

6.4.1. Definition

6.4.1.1. Canonical Characterization

\(W_0 = 0\)
\(W_t\) is almost surely continuous
\(W_t\) has independent increments: \(W_{t_1} - W_{s_1}\) and \(W_{t_2}-W_{s_2}\) are independent.
\(W_t - W_s \sim \mathcal{N}(0, t - s)\) for \(0\le s \le t\), where \(\mathcal{N}(\mu, \sigma^2)\) is the normal distribution.

6.4.1.2. Lévy Characterization

\(W_0 = 0\)
Almost surely continuous
Martingale
Quadratic variation: \( [W]_t = t \)
- This is the core property of the Itô calculus: \(dB_t^2 = dt\).

6.4.1.3. Spectral Characterization

Sine series whose coefficients are independent \(\mathcal{N}(0,1)\) random variables. This is the result of the Kosambi-Karhunen-Loève Theorem.

6.4.2. Construction

Scaling limit of a random walk. This is the result of the Math/Donsker's Theorem.

6.4.3. Properties

It describes the Brownian motion.
It is the integral of the white noise generalized by the Gaussian process

6.5. Gaussian Process

6.5.1. Definition

A continuous stochastic process \( \{ X_t ; t\in T \}\) is Gaussian if and only if for every finite set of indices \( t_1, \ldots, t_k \) in the index set \( T \) \[ \mathbf{X}_{t_1,\dots, t_k} = (X_{t_1}, \dots, X_{t_k}) \] is a multivariate Gaussian random variable.

6.5.2. Covariance Function

The second-order statistics completely defines a Gaussian process.

The variances and covariances can be given by the covariance function \( K(x,x') \), and together with the restriction of the domain, they completely determines the probability density over functions with a continuous domain.

6.5.3. Kriging

Gaussian Process Regression

The prior distribution is determined by suitable choice of hyperparameters for the covariance function (or kernel), and the observation is used to update the prior.

It is a form of Bayesian inference, using the Gaussian process as a prior probability distribution.

6.6. Properties

A stochastic process \( (X_t)_{t\in T} \) on the probability space \( (\Omega, \Sigma, \mathrm{P}) \) generates the natural filtration \( (\Sigma_t)_{t\in T} \) of \( \Sigma \): \[ \Sigma_t := \sigma(X_k \mid k \le t). \]
- The probability space is the product space of all domains of \( (X_t)_{t\in\mathbb{T}} \).

Footnotes:

https://en.wikipedia.org/wiki/Probability_axioms